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Abstract 

Background: In recent years, many algorithms have been developed for network-based analysis of differential 
gene expression in complex diseases. These algorithms use protein-protein interaction (PPI) networks as an 
integrative framework and identify subnetworks that are coordinately dysregulated in the phenotype of interest. 

Motivation: While such dysregulated subnetworks have demonstrated significant improvement over individual 
gene markers for classifying phenotype, the current state-of-the-art in dysregulated subnetwork discovery is almost 
exclusively limited to binary phenotype classes. However, many clinical applications require identification of 
molecular markers for multiple classes. 

Approach: We consider the problem of discovering groups of genes whose expression signatures can discriminate 
multiple phenotype classes. We consider two alternate formulations of this problem (i) an all-vs-all approach that 
aims to discover subnetworks distinguishing all classes, (ii) a one-vs-all approach that aims to discover subnetworks 
distinguishing each class from the rest of the classes. For the one-vs-all formulation, we develop a set-cover based 
algorithm, which aims to identify groups of genes such that at least one gene in the group exhibits differential 
expression in the target class. 

Results: We test the proposed algorithms in the context of predicting stages of colorectal cancer. Our results show 
that the set-cover based algorithm identifying "stage-specific" subnetworks outperforms the all-vs-all approaches in 
classification. We also investigate the merits of utilizing PPI networks in the search for multiple markers, and show 
that, with correct parameter settings, network-guided search improves performance. Furthermore, we show that 
assessing statistical significance when selecting features greatly improves classification performance. 



Introduction 

Genome-wide monitoring of mRNA expression, moni- 
tored using DNA microarrays and more recently deep 
sequencing, has proved quite useful in understanding 
the mechanistic bases of complex human diseases. 
Systematic analysis of differential gene expression in dif- 
ferent phenotypic classes leads to identification of novel 
biomarkers, which serve as features for phenotype classi- 
fication, as well as targets for therapeutic intervention. In 
previous studies, differential analysis of gene expression 
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led to identification of biomarkers for a range of complex 
diseases, including Parkinson's disease [1], neuroblastoma 
[2], lung cancer [3] and breast cancer [4]. 

Traditional analyses generally take a univariate approach 
to study gene expression and identify genes with signifi- 
cant individual differential expression in the phenotype of 
interest. However, such univariate approaches are often 
limited in explaining the underlying mechanisms of com- 
plex diseases, which arise from the interplay among multi- 
ple genetic and environmental factors. For example, genes 
that cooperate or complement each other in pathogenesis 
may not necessarily be differentially expressed individually, 
but exhibit coordinated dysregulation when considered 
together. 
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In order to address the shortcomings of the univariate 
approaches, Chuang et al. develop an algorithm that 
integrates gene expression data with protein-protein 
interaction (PPI) networks to identify reproducible 
breast cancer metastasis markers composed of multiple 
interacting proteins ("dysregulated subnetworks") [5]. 
They show that these subnetwork markers better predict 
breast cancer metastasis as compared to individually 
dysregulated genes. Motivated by the demonstrated pro- 
mise of this approach, several other algorithms are 
developed for network-based analysis of differential gene 
expression. In particular, Chowdhury et al. develop a 
set-cover based heuristic for identification of genes that 
complement each other in discriminating phenotype and 
control samples [6]. Phuong et al. further improve on 
these algorithms by introducing a biclustering algorithm 
that also accounts for the noise in PPI networks by 
incorporating reliability scores for PPIs [7]. More 
recently, recognizing the shortcomings of greedy algo- 
rithms in identifying dysregulated subnetworks, Phuong 
et al. introduce a color-coding based randomized algo- 
rithm to identify subnetworks that are highly discrimi- 
native of phenotype and control [8]. These methods are 
also extended to the identification of subnetwork 
expression signatures that can shed light into the regula- 
tory logic of the relationship between the dysregulation 
of multiple genes and the disease phenoype. In particu- 
lar, Chowdury et al. identify subnetworks whose combi- 
natorial expression states are indicative of phenotype by 
using a branch-and-bound algorithm [9], Dutkowski 
et al. grow network-guided forests by training decision 
trees using interacting proteins [10]. 

All of the existing dysregulated subnetwork discovery 
algorithms are designed and validated for binary pheno- 
type classes [e.g. cancerous vs. non-cancerous, metastatic 
vs. non-metastatic, drug responders vs. non-responders) 
and prove to be promising in terms of accurate classifica- 
tion of samples. However, many progressive diseases 
such as glioblastoma, breast cancer and colorectal cancer 
require identification of molecular markers for multiple 
classes (such as the four stages in colorectal cancer 
according to Dukes' classification) for effective prognosis 
and treatment. This implies the necessity of a framework 
that can also work on datasets with more than two phe- 
notype classes for network-based discovery of disease 
markers. Although most of the existing algorithms can 
be applied to multiple phenotype classes in principle, no 
tool is readily available for this purpose. Furthermore, 
subnetwork discovery on multi-class datasets requires 
additional design choices and poses novel algorithmic 
challenges. These choices include designing criteria to 
evaluate the dysregulation of a subnetwork; i.e., are we 
interested in identifying subnetworks that can distinguish 
all classes from each other at once, or are we interested 



in identifying subnetworks that serve as indicators for 
specific classes. The algorithmic challenges, on the other 
hand, include unproportionately distributed samples 
across multiple classes. For these reasons, novel algo- 
rithms are needed that are robust and can work with 
datasets that are composed of different number of classes 
and sample distributions. 

Contributions of this study 

In this article, we introduce novel algorithms for net- 
work-based analysis of differential gene expression on 
applications that involve multiple phenotype classes. As 
an important application, we focus particularly on identi- 
fying subnetworks that can discriminate different stages 
of human colorectal cancer (CRC) according to Dukes' 
classification. We first propose two formulations to gen- 
eralize information-theoretic measures of subnetwork 
dysregulation to multiple phenotype classes. These for- 
mulations differ in terms of how the target subnetworks 
discriminate phenotype classes from each other; namely 
we establish information-theoretic criteria for all-vs-all 
and one-vs-all discriminative subnetworks. Then, we 
extend the set-cover based algorithm by Chowdury et al. 
, NETCOVER, to identify one-vs-all discriminative sub- 
networks [6]. We also introduce a framework for asses- 
sing the statistical significance of the sub-networks 
identified by the set-cover based algorithm. Using public 
CRC datasets composed of samples labeled with Dukes' 
four stages, we investigate the performance of the result- 
ing algorithm, COBALT, in identifying subnetworks that 
are useful in predicting the stages of colon cancer sam- 
ples. In particular we perform systematic computational 
experiments to investigate the following: 

♦ We compare the performance of all-vs-all and one- 
vs-all subnetworks in predicting phenotype and 
show that one-vs-all discriminative subnetworks are 
generally more reliable as features for classification. 

♦ We investigate the effect of using the PPI network 
to confine the space for searching groups of genes 
that are coordinately dysregulated subnetworks. We 
show that, while expansion of the search space 
through consideration of indirect interactions 
improve the classification performance of identified 
subnetworks, this improvement saturates after a 
point, demonstrating that PPI networks indeed pro- 
vide a shortcut to the identification of dysregulated 
groups of genes. We also show that our efficient set 
cover based algorithm renders network-free search 
feasible. 

♦ We investigate the effect of using statistically sig- 
nificant subnetworks (as opposed to high-scoring 
subnetworks) as features for classification and show 
that assessment of statistical significance facilitates 
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identification of more useful subnetwork features for 
classification. 

In the next section, we start our discussion by propos- 
ing two alternate information-theoretic formulations of 
sub-network dysregulation. We also introduce our set- 
cover based algorithm, COBALT, for the identification of 
one-vs-all discriminative subnetworks and propose meth- 
ods for assessing the statistical significance of the identi- 
fied subnetworks. Subsequently, in Results Section, we 
provide comprehensive experimental results on the clas- 
sification performance of the subnetworks discovered by 
COBALT in predicting the stage of CRC on two gene 
expression datasets obtained from the Gene Expression 
Omnibus. We conclude the paper in Conclusion Section. 

Methods 

In this section, we start by introducing the mathematical 

background of the information-theoretic formulation of 
coordinate dysregulation for a set of genes. Subsequently, 
we propose two alternate approaches for generalizing this 
notion to multiple phenotype classes. We then introduce 
COBALT, our set-cover based algorithm that is specifically 
designed to identify stage-specific discriminative subnet- 
works. Finally, we introduce a framework for assessing the 
statistical significance of the identified subnetworks, and 
describe how these subnetworks can be utilized for classi- 
fication of samples. 

Dysregulation of subnetworks 

For a given set V of genes and JJ of samples, let £, e 
i?'"' represent the properly normalized gene expression 
vector for gene e V , where £;(/) denotes the relative 
expression of gi in sample Sj €U . Assume that we have 
a set T, composed of different classes for the phenotype 
of interest (such as the four stages in colorectal cancer 
according to Dukes classification) and the phenotype 
vector C annotates each sample with one of the labels 
in T , i e-) C(j) = t where t&T -We also define the set 
of all samples for a specific phenotype class t as 
W'*' = {Sj&U: C{]) = t}. 

Let ^(V, e) denote a PPI network where the product 
of each gene gi g V is represented by a node and each 
edge gigj represents an interaction between the products 
of gi and gj. Given a PPI network and a gene expression 
dataset over multiple phenotype classes, we are inter- 
ested in finding sets of genes that can together discrimi- 
nate the phenotype classes with their gene expression 
signatures. In order to establish the functional relevance 
of these gene sets and search for these sets more effi- 
ciently, we confine the search space to PPI subnetworks, 
that is groups of proteins that are functionally interre- 
lated through PPIs. Formally, a set <S c V of proteins is 



considered a subnetwork of interest if for all proteins 
gi^ S , there is at least one other protein gj £ S such 
that gi and g, are connected through at most £ hops in 
the PPI network. Here, £ is a parameter that adjusts the 
trade-off between functional relevance and computa- 
tional efficiency; a larger £ allows searching for function- 
ally less related proteins at the cost of increasing the 
search space. 

For a given subnetwork S cy , Chuang et al. define 
the subnetwork activity of tSas £5 =X!&e'5WvT^> 
that is the aggregate expression profile of the genes in 
S [5]. Using subnetwork activity, they define an infor- 
mation-theoretic measure to quantify the dysregulation 
of a subnetwork. This "additive" definition of dysregula- 
tion limits the framework to the identification of subnet- 
works with all genes in the subnetwork dysregulated in 
the same direction (i.e., all up- or down-regulated in the 
phenotype of interest), and alternate approaches that 
compute combinatorial expression signatures are shown 
to be more powerful [9,10]. However, this additive for- 
mulation serves as a useful starting point to generalize 
subnetwork dysregulation to phenotypes that involve 
multiple classes. For this reason, we focus on additive 
subnetwork activity in this paper. 
All-vs-all discriminative power of a subnetworl( 
It is straightforward to generalize the information-theo- 
retic measure for the dysregulation of a subnetwork [5] 
to multiple phenotype classes. Namely, the mutual infor- 
mation between the subnetwork activity of S and the 
multi-class phenotype vector, i.e., A aii-vs-aii (S) = / (£s, 
C) = H{C) - H{C\Es ), provides a measure of the the 
reduction in the uncertainty about C given Es- Here, 

H(X) = — y^xexPM log(PW) denotes the Shannon 
entropy of discrete random variable X that can take 
over values from the set X . In our case, the support 
set for the random variable C is T> whereas the support 
set for the random variable Eg is obtained by appropri- 
ately quantizing the expression levels. 
One-vs-all discriminative power of a subnetwork 
Here, we propose an alternate measure to quantify the 
power of a subnetwork in discriminating multiple phe- 
notype classes from each other. This measure targets 
discriminating a particular phenotype class from all 
other classes. Namely, we define class-specific phenotype 
vector Cf-*^ for class t g 7" as 

it) _ fl ifQ = t |-,^ 
~ [0 otherwise. 

Then, the mutual information between subnetwork 
activity and the class-specific phenotype vector C''\ i.e., 

^one vs all =/(£5,C^''). provides a measure of the 
reduction in the uncertainty about class t given £5. This 
formulation offers a number of benefits as compared to 
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the all-vs-all formulation of discriminative power: (1) 
The one-vs-all formulation may lead to identification of 
more interpretable markers, since, for example, it can 
provide stage-specific molecular signatures for colorectal 
cancer. (2) The one-vs-all formulation can extract class- 
specific molecular signatures that may be missed by the 
all-vs-all formulation because they do not discriminate 
other classes well. (3) The phenotype random variable 
takes a smaller number of values, thus offering more 
statistical power for the same number of samples. The 
concepts of all-vs-all and one-vs-all disciminative power 
of subnetworks and the computation of mutual informa- 
tion using these different formulations are illustrated in 
Figure 1. 



Identifying one-vs-all discriminative subnetworks 

The problems of identifying subnetworks with maximum 

Aall - vs - all 

(S) or Aone-vs-all(S) are intractable [5]. How- 
ever, it is straightforward to generalize the greedy algo- 
rithm by Chuang et al. to solve both problems efficiently 
[5]. This greedy algorithm initializes a subnetwork with a 
single protein. It then grows the subnetwork by adding 
the protein in the neighborhood of the subnetwork (i.e., 
reachable from the subnetwork with £ hops) that 
improves the objective function ( Aaii _vs - all(S) or 
Aone -vs - all(S)) the most. The algorithm stops either 
when there is no more protein in the neighborhood to 
add, or the best improvement provided by a protein in 
the neighborhood is below a user-defined threshold. 
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Figure 1 Illustration of the difference between all-vs-all and one-vs-all discriminative subnetworks. Illustration of the difference between 
all-vs-all and one-vs-all discriminative subnetworks. The aggregate expression profiles of five hypothetical subnetworks are shown. Red and 
green respectively represent positive and negative expression, with intensity representing magnitude. The subnetworks Si, S2, S3 and 54 are 
one-vs-all discriminative, respectively indicating classes (CRC stages) A, B, C and D, since the expression profile of each subnetwork in samples 
that belong to the respective class can discriminate these samples from other classes. On the other hand, all-vs-all discriminative 

subnetwork, since it discriminates all classes from each other 
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While the explained greedy algorithm is quite effective 
in efficiently discovering high-scoring subnetworks, it 
has several drawbacks [6]. First, this algorithm is biased 
toward identifying subnetworks with very few proteins 
that exhibit high dysregulation individually. This is 
because the algorithm lacks global awareness, i.e., it will 
stop expanding the subnetwork when the best candidate 
protein to add to the subnetwork has only marginal 
individual contribution, but may actually contribute a 
greater deal when additonal proteins are added. Second, 
this approach requires computation of mutual informa- 
tion for each and every candidate protein in the neigh- 
bourhood to be added to the growing subnetwork, 
which may prove to be costly when the algorithm needs 
to be run multiple times to assess statistical significance 
of identified subnetworks. Motivated by these obser- 
vations, Chowdhury et al. develop a set-cover based 
algorithm, NETCOVER, which is more effective in dis- 
covering proteins that complement each other in discri- 
minating phenotype and control samples [6]. However, 
NETCOVER is designed for binary phenotype classes 
and it assumes that the samples are paired. Here, we 
argue that the algorithmic insights introduced by NET- 
COVER suit particularly well to the identification of 
one-vs-all discriminative subnetworks. Based on this 
observation, we develop COBALT, which generalizes 
NETCOVER to handle unpaired samples and multiple 
phenotype classes to identify one-vs-all discriminative 
subnetworks. 

COBALT:Cover-based algorithm for identifying one-vs-all 
discriminative subnetworks 

Recall that a one-vs-all discriminative subnetwork is 
defined as one with differential subnetwork activity in a 
specific phenotype class, as compared to all other classes. 
Since subnetwork activity is defined regularly, the genes 
in such a subnetwork have to be either all up-regulated 
or all down-regulated in the phenotype class of interest. 
Motivated by this observation, COBALT aims to identify 
subnetworks such that for each sample that belongs to 
the phenotype class of interest, there exists at least one 
gene in the subnetwork that is up-regulated (or down- 
regulated) in that sample. Such subnetworks are said to 
"cover" the entire patient population that represents the 
phenotype class of interest. 

In order to identify the genes that are up-regulated or 
down-regulated in each sample, we use the expression 
of that gene in all samples as the background distribu- 
tion. Subsequently, we identify samples in which the 
genes' expression deviates significantly from this back- 
ground distribution. Namely, for a gene gi, consider the 
distribution of the expression values £, across all sam- 
ples. We compute a quantized expression value for gi in 
sample s, as follows: 



+ 1 if£iO) > 
-1 ifEjQ) < iii 
0 otherwise 



+ a * (Ti 
— a * fTi 



(2) 



Here, {ii and Oi respectively represent the mean and 
standard deviation of expression value of gi across all sam- 



ples, i.e., /^i = Y^SieuEi{j)/\U\ and ai = ^J^sM^m- mf- 

We define a as a user-defined threshold parameter for a 
gene's dysregulation in a sample to be considered signifi- 
cant. We say that a gene gi positively covers a sample Sj if 
Ei{i) = +1 , and negatively covers Sj if = — 1 , Follow- 
ing this definition, for a given gene gi and phenotype class 
t e 7~> we define the positive cover set 'p^'^'^ as the set of 
all samples with phenotype t that are positively covered by 
gi, i.e., -Pf = {sj eU :€(]) = t andB;0') = Similarly, 

the negative cover set J\f^^^ contains all samples in class t 

that are negatively covered by gene gi. We illustrate these 
concepts in Figure 2. It can be shown that the mutual 
information of C**' and £, is a monotonically non-decreas- 
ing function of the cardinalities of p^'' and JV^*^'' [6], 
i.e., one can maximize I{Ei,C^*^) by maximizing 

Using the negative and positive cover sets for each 
gene-class pair, COBALT identifies one-vs-all discrimi- 
native subnetworks for each phenotype class by using a 
greedy heuristic that is shown to be effective for the set- 
cover problem [11]. Namely, for each gene gi, we first 
identify the target phenotype class for that gene as the 
phenotype class with largest percentage of samples posi- 
tively (or negatively) covered by gi. Subsequently, we 
grow a subnetwork by systematically adding a gene in 
the neighborhood based on the coverage on the rest of 
the uncovered samples for that class. Without loss of 
generality, the algorithm identifies a minimal positive 
covering subnetwork seeded at gene gi as follows: 

1. Initialize subnetwork: S; <— {g/} 

2. Define the target phenotype class t for this subnet- 
work as the class that has the maximum fraction of 
samples positively covered by gc. t <— argmax^' ^ T 

3. Initialize the set of uncovered samples for class t: 

4. Initialize the set of network neighbors: 
Q^{g;eV:5fe,g;)<£} 

5. For all genes gj e Q , compute 
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Figure 2 Illustration of the set-cover based algorithm for the identification of one-vs-all discriminative subnetworks Here, the coverage 
provided by a hypothetical three-gene subnetwork is shown. In the left panel, the distribution of the expression levels of each gene across all 
samples is shown. We first compute the mean (fj/) and the standard deviation (cr,) of this distribution for each gene g,-. Subsequently, we identify 
samples that are positively or negatively covered by each gene. A gene g, with expression greater than a* a, in a sample is said to 
positively cover that sample, while a gene g,- with expression less than a* a, in a sample is said to negatively cover that sample (we set a - 
2 in our experiments). The negative and positive cover sets for each gene and the subnetwork composed of g,, g-i and are shown on the 
right. In this example, this subnetwork negatively covers (all samples in) Stage B. 



6. Find all the genes (can be multiple) in Q with 
maximum \{'pf'')' \ - I (A/j''')'! and let be the 
gene among these genes with minimum 

r ^ Sk has minimum positive back- 
ground coverage). 



7. Expand the subnetwork with gf:. Si <— Si U {g^} 

8. Update the set of uncovered positive samples for 
class t a^'-'^ ^ \ Vk'^ 

9. Update set of neighbouring genes: Q <— Q U {g, e 

V ■.s{g,,gj)<e}\{gk} 

10. If Q = 0 or M^'' = 0, return 5,-; otherwise, go to 
step (5). 
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This algorithm is also used for identifying the minimal 
negative covering subnetwork seeded at gi by simply 
replacing Q with V above. 

Assessing statistical significance of subnetworks 

In order to asses the significance of the identified sub- 
networks, we perform two distinct significance tests. 
Each significance test is performed by generating an 
empirical background distribution that carefully 
accounts for multiple hypothesis testing. The first back- 
ground distribution is obtained by randomly permuting 
the class labels. The second background distribution, on 
the other hand, is obtained by permuting the gene 
expression profiles (the rows of the gene expression 
matrix, thereby randomly reestablishing the relationship 
between the expression profiles and the nodes in the 
PPI network ). After generating a large number of these 
randomized datasets, we use COBALT to identify class- 
specific subnetworks for the randomized datasets as 
well. We then use these subnetworks as the background 
distribution to test the statistical significance of the dis- 
criminative power of the stage-specific subnetworks 
identified on the actual dataset. This approach implicitly 
handles multiple hypothesis testing, since the back- 
ground distribution is constructed using the most discri- 
minative subnetworks that could be identified on each 
randomized dataset. 

Note also that the cover provided by a subnetwork for a 
target phenotype class depends on the size of the sub- 
network {i.e., the number of proteins in the subnetwork). 
In other words, if we construct subnetworks at random, 
we would expect larger subnetworks to have a higher 
coverage. Furthermore, in our experiments, we also 
observe that larger subnetworks tend to have higher dis- 
criminative power (A ). Motivated by these insights, we 
assess the statistical significance of a subnetwork as a 
function of its size. For this purpose, we stratify the sub- 
networks that compose each background distribution 
according to subnetwork size and compute the p-vzhie of 
a subnetwork S by comparing A(S') to the discriminative 
power of the background subnetworks that have similar 
size to the subnetwork of interest. More precisely, the 
/^-value of S is defined as the fraction of subnetworks 
with discriminative power greater than that of S among 
all subnetworks in the background set with size equal to 
that of S. A minimal covering subnetwork S discovered 
by COBALT is considered to be statistically significant if 
its p-va\ue is less than the significance threshold for both 
background populations. 

Using identified subnetworks for classification 

One application of identifying subnetworks that can dis- 
criminate multiple phenotype classes is to predict the 
phenotype class of a test sample using the expression 



profiles of these subnetworks. This application also pro- 
vides a useful means for assessing the biological rele- 
vance, reproducibility, and utility of the identified 
subnetworks. In order to use stage-specific subnetworks 
in colorectal cancer to predict the stage of a patient, 
following steps are performed: 

1. We first identify both the positive and negative 
covering subnetworks for each gene ^, e V ■ 

2. In order to investigate the effect of statistical sig- 
nificance on the classification utility of subnetworks, 
we use two alternate strategies to extract a list of 
features from the sets of covering subnetworks 
found in (1): 

(a) The first approach assumes that high-scoring 
subnetworks are more useful for classification, as 
compared to significantly discriminative subnet- 
works. For each phenotype class t e T< vfe sort 
the subnetworks based on their all-vs-all 

(Aall — vs— all 

(S)) or one-vs-all (Aaii 

— VS— all 

(S)) dis- 
criminative power. We then choose the top k 
positive and negative covering subnetworks for 
each phenotype class, giving us a total of 2*A:*|7"| 
features to be used in classification. Here, A: is a 
user defined parameter to set the number of 
stage-specific subnetworks that are used for each 
class. 

(b) The second approach assumes that assess- 
ment of statistical significance will facilitate 
selection of biologically more meaningful subnet- 
works, also providing more power in classifica- 
tion as compared to high-scoring subnetworks. 
For this purpose, using the two proposed statisti- 
cal significance tests discussed in the previous 
section, we identify subnetworks that are signfi- 
cant according to both statistical tests and use all 
of these significantly discriminative subnetworks 
as features for classification. 

3. Once we obtain a final list of features either 
using (a) or (b) at step (2), we compute the aggre- 
gate expression profiles (£5 ) for each of these 
selected subnetworks and use these to construct 
feature vectors for each sample (where each fea- 
ture represents the aggreate expression of one 
subnetwork). 

4. Finally, we use these feature vectors to train 
and test classifiers for predicting the class of the 
phenotype of interest. 

Results and discussion 

In this section, we first give brief information about the 
colorectal cancer in human (CRC) and introduce the 
two stage-specific CRC datasets and the PPI network we 
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use in our experiments. Subsequently, we describe in 
detail the experimental framework used. After introdu- 
cing the performance evaluation metrics used, we pre- 
sent our experimental results comparing one-vs-all and 
all-vs-all discriminative subnetworks, as well as the addi- 
tive and set-cover based algorithms that are used to dis- 
cover these subnetworks. Next, we analyze the effect of 
the network distance parameter {£) that adjusts the 
search space size when growing the subnetworks. 
Finally, we compare the performance of high-scoring 
and statistically significant subnetworks in predicting the 
stages of samples. 

Human colorectal cancer 

Colorectal Cancer (CRC) is one of the most common 
causes of cancer related deaths in the western civiliza- 
tion [12]. Diagnosis of CRC is often difficult as the 
symptoms appear only at the advanced stages of the dis- 
ease. Moreover, early diagnosis is very critical as the 
survival rate changes dramatically with the stage of the 
cancer. In fact, 5 year survival rates when diagnosis is 
made at the localized stage (cancer is confined in the 
primary site) and after cancer has metastasized are 
around 90% and 12% respectively [13]. These observa- 
tions suggest that, for effective diagnosis, prognosis and 
treatment, accurate determination of disease stage is 
crucial. 

There are different classification systems for the pro- 
gression of colorectal cancer. Dukes' famous staging sys- 
tem classifies patients based on how far the cancer is 
spread [14]. TNM is another staging method providing 
a more comprehensive framework including information 
about the size and localization of the tumor, as well as 
the involvement of lymph nodes [15]. CRC Datasets we 
use in our experiments are classified by Dukes' staging 
system. 

Datasets 

We use two CRC microarray datasets obtained from the 
Gene Expression Omnibus [16] in our experiments. 
These datasets that contain labeled CRC samples with 
Dukes' 4-stage classification are the following: 

♦ GSE14333 contains the expression profiles of 
54675 genes in 290 samples. 

♦ GSE5206 contains the expression profiles of the 
same 54675 genes in 98 samples. 

The distribution of the samples in each dataset with 
respect to CRC stage is shown in Table 1. 

The human protein-protein interaction data used in 
our experiments is obtained from NCBI Entrez Gene 
Database [17]. This database integrates interaction data 
from several other databases available, such as HPRD, 



Table 1 Number of samples labeled with each colorectal 
cancer stage based on Dukes' 4-stage classification in the 
datasets used in our experiments. 



stage 


A 


B 


C 


D 


total 


GSE14333 


44 


94 


91 


61 


290 


GSE5205 


12 


32 


33 


21 


98 



Bi-oGrid, and BIND. We remove the nodes with no 
interactions to obtain a final PPI network that contains 
8959 proteins and 33,528 interactions among these 
proteins. 

Experimental design 

COBALT is fully implemented in Matlab. We use this 
implementation to perform the following classification 
experiments: 

♦ Prediction of disease stage in GSE14333. Subnet- 
works discovered using GSE14333 are used to pre- 
dict the stages of samples in the same dataset in a 
10-fold cross validation setting. Samples in each phe- 
notype class are randomly separated into ten similar- 
sized groups. In each iteration, one of the groups in 
each class is chosen to be the test data and the rest 
of the data is used to train the classifier. 

♦ Prediction of disease stage in GSE5206. Subnet- 
works discovered using GSE14333 are used to pre- 
dict the stages of samples in GSE5206. In this cross- 
classification setting, the classifier is trained on 
GSE14333 and tested on the other dataset, GSE5206. 

For both of these settings, we use a naive Bayesian clas- 
sifier provided by Matlab's classify function. Using other 
classifier options provided by Matlab's classifier proce- 
dure only marginally effects the results (data not shown). 

Classification performance 

In order to obtain a comprehensive picture of the perfor- 
mance of different approaches, we list the precision and 
recall of the classification experiments for each pheno- 
type class separately. Precision refers to the percentage of 
the correct predictions over all samples predicted belong- 
ing to the respective CRC stage, whereas recall refers to 
the percentage of the correctly predicted samples over all 
samples that are clinically diagnosed to belong the 
respective CRC stage. Please note that we set the network 
distance parameter £ = 3 in all experiments unless other- 
wise noted, since it provides the best performance as 
shown in the next section. When quantizing the expres- 
sion values of a gene over all samples using Equation 2, 
we set a = 2 as the threshold parameter for the gene's 
dysregulation in a sample to be considered positively or 
negatively covering. 
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We compare COBALT with our implementation of the 
two additive greedy approaches explained in detail in the 
Methods section, namely additiveova and additiveava- 
additivegya and additiveava refer respectively to the algo- 
rithms that aim to identify one-vs-all and all-vs-all sub- 
networks by greedily maximizing the discriminative 
power of the subnetworks (Aone-vs-aii and Aaii^vs-aii)- 
In the first set of experiments, we use the 10-fold cross 
validation framework for prediction of disease stage of 
samples in GSE14333, using the high-scoring subnet- 
works extracted as features from the same dataset, i.e., 
we use the top scoring positive and negative subnetworks 
for each stage in COBALT (setting k = 1), top 2 subnet- 
works for each stage in additiveova and top 8 subnet- 
works for additiveava as features (a total of 8 features 
used in each method). 

As shown in Figure 3, COBALT provides better perfor- 
mance compared to both of the additive approaches in 
terms of the precision in prediction for all CRC stages. It 
also provides better recall values for all CRC stages 
except for samples in Stage A. COBALT achieves 0.84 
weighted average precision over all stages where as addi- 
tiveova and additiveava respectively achieve 0.71 and 0.62 
precision. Similarly, COBALT outperforms others by 
achieving 0.84 weighted average recall over all stages 
where as additiveova and additiveava respectively provide 
0.69 and 0.60 recall. 

The effect of the PPI network on classification performance 

In this section, we discuss the effect of the PPI network 
in the classification performance of subnetworks identi- 
fied by COBALT. Since the use of the PPI network con- 
fines the search space to functionally related groups of 
proteins, these experiments provide insights into whether 
these functional constraints also improve the biological 



reproducibility and the utility of identified stage-specific 
subnetworks. For this purpose, we systematically evaluate 
the classification performance of the subnetworks for 
varying network distance parameter {€) that adjusts the 
search space size when growing the subnetworks using 
COBALT. We also compare the subnetworks identified 
by the network-guided algorithm with groups of genes 
that are identified by using the same algorithm in a net- 
work-free fashion. The set-cover based algorithm imple- 
mented by COBALT is quite efficient, therefore a 
network-free search for stage- specific groups of proteins 
is feasible. 

In the PPI network free approach, the next protein to 
be added to the subnetwork does not need to be in a cer- 
tain proximity {i.e., I is effectively set to oo) to the pro- 
teins already in the subnetwork. This increases the search 
space for the algorithm, thus making it infeasible for 
most of the state-of-the-art algorithms to perform some 
complex analyses such as statistical significance compu- 
tations. The effect of the parameter £ on the classification 
performance for each CRC stage is shown in terms of 
precision and recall in Figures 4 (a) and 4(b) respectively. 
As seen in the figures, the classification performance 
(hence the reproducibility) of identified subnetworks 
improves as PPI network neighborhood is defined more 
flexibly. This is expected, since the PPI network is incom- 
plete, thus consideration of indirect interactions accounts 
for missing interactions to a certain extent. However, as 
the search diameter reaches 3, the classification perfor- 
mance saturates and adding more flexibility to the search 
does not improve performance any more. This observa- 
tion suggests that incorporation of PPI networks is useful 
for increased efficiency of the search, as well as identifica- 
tion of more reproducible subnetworks. Thus we set 




Colorectal Cancer Stages Colorectal Cancer Stages 



(a) Stage specific precision values <b) Stage specific recall values 

Figure 3 Precision and recall values for each stage in prediction of stages of samples in GSE14333. Precision and recall values for each 
stage in prediction of stages of samples in GSE14333 in a 10-fold cross validation framework are shown in (a) and (b) respectively. COBALT 
outperforms the additive approaches in terms of precision for all CRC stages. It also provides better recall values for all CRC stages except for 
Stage A. 
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Figure 4 Precision and recall plots of each stage with respect to different values of the network distance parameter £ Precision and 
recall plots of eacli stage, in prediction of CRC stages of samples in GSE14333 in a 10-fold cross validation framework are shown in (a) and (b) 
respectively, with respect to different values of the network distance parameter i, as well as the PPI-free approach. Setting £ = 3 provides the 
best performance for most of the cancer stages in terms of both precision and recall, also providing a smaller search space compared to £ = 4 
and the PPI-free approach. 



£ = 3 as it is the most reasonable choice in terms of both 
the classification performance and the computational 
efficiency of the algorithms. 

The effect of using statistically significant subnetworks on 
classification performance 

In this section, we compare the classification perfor- 
mance of high-scoring subnetworks to that of statisti- 
cally significant subnetworks. On the GSE14333 dataset, 
COBALT identifies 9 statistically significant subnetworks 
with 139 unique genes (please see Additional File 1 for 
the list of genes and the covered stage of colorectal can- 
cer for these 9 subnetworks). Using these subnetworks 
as features, we predict the stage of the samples in 
GSE14333 in a 10-fold cross validation framework as 
previously explained. We choose the same number of 
features with top (Aone -vs-aiit'?). score and compare 
the classification performance of both approaches. 
Stage-specific precision and recall values for both 
approaches are shown in Table 2. As seen in the table, 
utilizing statistical significance computations when 
choosing features improve performance for predicting 
the stages of patients in GSE14333, with less number of 
unique genes used. 

In the cross-classification framework, we use the statisti- 
cally significant features identified in GSE14333 to pre- 
dict the classes of samples in the GSE5206 dataset, i.e., 
the classifier is trained using GSE14333 and tested on 
GSE5206. The stage-specific precision and recall values 
are shown in Table 3. The weighted average precision 
and recall values are 0.57 and 0.56 respectively. 

Conclusions 

In this article, we have proposed two alternate formula- 
tions of the discriminative power of subnetworks when 



working on multi-class phenotypes, namely, one-vs-all 
and all-vs-all. We then introduced our cover-based algo- 
rithm for network-guided disease marker discovery, for 
identifying subnetworks with one-vs-all discriminative 
power. Moreover, we have introduced a framework for 
assessing the statistical significance of the identified sub- 
networks. Systematic experiments on real multi-staged 
CRC datasets show that the proposed algorithm outper- 
forms the additive algorithms in terms of providing 
higher precision and recall in prediction of sample 



Table 2 Contingency tables for prediction of CRC stages 
of samples in GSE14333. 

Predicted Classes 





stage 


A 


B 


C 


D 


recall 


Actual Classes 


A 


37 


3 


0 


4 


0.84 




B 


3 


74 


11 


6 


0.78 




C 


5 


5 


77 


4 


0.84 




D 


4 


7 


3 


47 


0.82 




precision 


0.75 


0.83 


0.84 


0.77 










Predicted Classes 








stage 


A 


B 


C 


D 


recall 




A 


38 


1 


2 


3 


0.86 


Actual Classes 


B 


4 


75 


8 


7 


0.79 




C 


3 


2 


83 


3 


0.91 



52 



0.85 



precision 



O.E 



0.92 0.85 



0.£ 



Contingency tables for prediction of CRC stages of samples in GSE14333 with 
COBALT using (a) 9 statistically significant features composed of 139 unique 
genes, (b) 9 features with top mutual information scores composed of 144 
unique genes. The weighted average precision scores are 0.86 and 0.81 for 
(a) and (b) respectively. Similarly, weighted average recall values are 0.85 and 
0.81 respectively for (a) and (b). 
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Table 3 Contingency table for prediction of CRC stages 
of samples in GSE5206 using the statistically significant 
features identified from GSE14333. 



Predicted Classes 



stage 


A 


B 


C 


D 


recall 


A 






1 


0 


0,,b 


Actual Classes B 


3 


18 


8 


3 


0.56 


C 


5 


3 


20 


5 


0.50 


D 


4 


3 


6 


8 


0.38 


precision 


0.42 


0.59 


0.57 


0.50 





stages. The efficient implementation of the cover-based 
algorithm enabled us to show that using statistically sig- 
nificant subnetworks as features improves classification 
performance compared to using same number of high- 
scoring subnetworks (in terms of mutual information 
with respect to the phenotype vector). We have also 
shown that guiding the subnetwork discovery search 
with the PPI network identifies subnetworks that are 
more informative (in terms of classification power) than 
the net-works identified without the PPI network. We 
have also investigated the impact of different values of 
the network distance parameter, £, and concluded that 
using £ = 3 is the most reasonable choice in terms of 
both classification performance and computational 
efficiency. 

Additional material 



Additional file 1: List of 9 statistically significant subnetworks 
identified for GSE14333 dataset. All the gene products in the 9 
statistically significant subnetworks identified for GSE14333 dataset are 
listed, as well as the covered colorectal cancer stage and cover direction 
of the corresponding subnetworks. Please note that these gene products 
might not be direct neighbours in the PPI network, as we set the 
network distance parameter £ = 3 in the experiments. 
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